﻿* Encoding: UTF-8.
*Replication File
    
* the dataverse does not accept .sav files well, turns them into .tab files and we found out during proofing 
* that the dataverse truncates a .sav when it turns into .tab it appears, so using .csv instead. This might lose some information like value labels etc.

*if a . Sav file then enter the location you store the data file in
Get FILE=' ENTER FILE LOCATION\PhDMatters011025.sav'.

*If a .csv file then the following should import it.AFTER the /FILE is edited to the location of where user stores the .CSV file  (replaces ENTER FILE LOCATION)
GET DATA  /TYPE=TXT
  /FILE="ENTER FILE LOCATION\PhDMatters011025.csv"
  /DELCASE=LINE
  /DELIMITERS=","
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /IMPORTCASE=ALL
  /VARIABLES=
  InstitutionName A51
  Department A50
  sex_num F1.0
  YearPHDearnedGraduated F4.0
  PHDEarnedfrom A51
  PHDEarnedIn A45
  DegreeProgramShort F1.0
  DegreeProgramShortwMissing F1.0
  InternationalPhD F1.0
  Rank_Num F1.0
  YearMadeProfessor F4.0
  YearMadeAssociate F4.0
  CareerLength F2.0
  CareerAssociateYears F2.0
  CareerAssociate F1.0
  Years2Assoc F2.0
  Years2Full F2.0
  Undergrad A53
  Typeofundergrad A13
  Private_vs_Public_only F1.0
  PublicorPrivateundergradRaw A13
  Citations_num F6.0
  HIndex_num F3.0
  Avg_Citations_phdearnedFrom F16.12
  Avg_Citations_CurrentInstitution F16.12
  Avg_hindex_phdearnedFrom F16.14
  Avg_hindex_CurrentInstitution F16.14
  RankbyGraduatingPhDFrom F2.0
  RankbyGraduatingCurrentlyAt F2.0
  Movement_All F3.0
  Avg_Career_CurrentInstitution F16.13
  AgeAdjCitations F16.14.
*CACHE.
*EXECUTE.
*DATASET NAME DataSet3 WINDOW=FRONT.
    
* TABLE S1 

FREQUENCIES VARIABLES=sex_num YearPHDearnedGraduated DegreeProgramShort DegreeProgramShortwMissing InternationalPhD 
    Rank_Num YearMadeProfessor YearMadeAssociate CareerLength CareerAssociate Years2Assoc Years2Full 
    Citations_num hindex_num Movement_All RankbyGraduatingPhDFrom RankbyGraduatingCurrentlyAt
  /STATISTICS=STDDEV MINIMUM MAXIMUM MEAN
  /ORDER=ANALYSIS.

USE ALL.
EXECUTE.

*TABLE 1 

FREQUENCIES VARIABLES=
InternationalPhD
PHDEarnedfrom
PHDEarnedIn
DegreeProgramShort 
Undergrad
sex_num
Typeofundergrad
YearPHDearnedGraduated 
YearMadeProfessor
YearMadeAssociate
Undergrad
Rank_Num
Citations_num
hindex_num
sex_num
 /STATISTICS=STDDEV MINIMUM MAXIMUM MEAN
  /ORDER=ANALYSIS.


* Private vs PublicUG,  NO TABLE the OUTPUT is NOTED IN THE MAIN TEXT 

FREQUENCIES VARIABLES=Private_vs_Public_only  
PublicorPrivateundergradRaw  Undergrad
  /ORDER=ANALYSIS.

* percent men women last NO TABLE the OUTPUT is NOTED IN THE MAIN TEXT 

USE ALL.
COMPUTE filter_$=(YearPHDearnedGraduated>2012).
VARIABLE LABELS filter_$ 'YearPHDearnedGraduated>2012 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

FREQUENCIES VARIABLES=sex_num 
  /ORDER=ANALYSIS.

*TABLE 2

USE ALL.
EXECUTE.

****FULL SAMPLE  for TABLE 2

FREQUENCIES VARIABLES=rank_num CareerAssociateYears YearMadeAssociate CareerLength Years2Assoc
Years2Full
 /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.

FREQUENCIES VARIABLES=CareerAssociate
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.

*double checking numbers

USE ALL.
COMPUTE filter_$=(Rank_Num=2).
VARIABLE LABELS filter_$ 'Rank_Num=2 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

FREQUENCIES VARIABLES=rank_num  YearMadeAssociate
  /ORDER=ANALYSIS.

*946 is number that are still associate, 163 missing year, so number is 783- the exact number above

USE ALL.
EXECUTE.


* Sig dif for TABLE 2
    

T-TEST GROUPS=sex_num(0 1)
  /MISSING=LISTWISE
  /VARIABLES=CareerLength
  /ES DISPLAY(TRUE)
  /CRITERIA=CI(.95).

T-TEST GROUPS=sex_num(0 1)
  /MISSING=LISTWISE
  /VARIABLES=Years2Assoc
  /ES DISPLAY(TRUE)
  /CRITERIA=CI(.95).

T-TEST GROUPS=sex_num(0 1)
  /MISSING=LISTWISE
  /VARIABLES=Years2Full
  /ES DISPLAY(TRUE)
  /CRITERIA=CI(.95).

* both vars are binary, so violates assumption of T-Test, although T-Test is pretty robust to these things, see the chi sq test

T-TEST GROUPS=sex_num(0 1)
  /MISSING=LISTWISE
  /VARIABLES=CareerAssociate
  /ES DISPLAY(TRUE)
  /CRITERIA=CI(.95).

CROSSTABS
  /TABLES=sex_num BY CareerAssociate
  /FORMAT=AVALUE TABLES
  /STATISTICS=CHISQ 
  /CELLS=COUNT 
  /COUNT ROUND CELL.


**** MEN ONLY FOR TABLE 2
    
USE ALL.
COMPUTE filter_$=(sex_num=1).
VARIABLE LABELS filter_$ 'sex_num=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

FREQUENCIES VARIABLES=rank_num CareerAssociateYears YearMadeAssociate CareerLength Years2Assoc
Years2Full
 /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.

FREQUENCIES VARIABLES=CareerAssociate
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.



USE ALL.
EXECUTE.

****WOMEN ONLY FOR TABLE 2
    
 USE ALL.
COMPUTE filter_$=(sex_num=0).
VARIABLE LABELS filter_$ 'sex_num=0 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

FREQUENCIES VARIABLES=rank_num CareerAssociateYears YearMadeAssociate CareerLength Years2Assoc
Years2Full
 /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.

FREQUENCIES VARIABLES=CareerAssociate
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.



USE ALL.
EXECUTE.

 



* ALL PLACEMENTS -SI TABLE S2

FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.



** TABLE 3 A-D the big table- including only US and only PS. 


USE ALL.
EXECUTE.

*include all dept missing as PS and and all international missing as US


FREQUENCIES VARIABLES=InternationalPhD
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.

*this ensures we include the 4 that are missing where PhD from

recode InternationalPhD (sysmis=0).
execute.


FREQUENCIES VARIABLES=InternationalPhD
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.

USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShortwMissing=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

** output for TABLE 3


FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.


* calculates for last 35 years column for TABLE 3 - THIS IS FOR the YEARLY AVERAGE COLUMN in TABLE 3

USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1 and YearPHDearnedGraduated>1987).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShortwMissing=1 and last 35 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.



* TABLE S3 SEPERATE BY SEX
    

**** Men and US Only  FOR TABLE S3
    

USE ALL.
EXECUTE.
    
USE ALL.
COMPUTE filter_$=(sex_num=1 and InternationalPhD=0 and DegreeProgramShortwMissing=1).
VARIABLE LABELS filter_$ 'sex_num=1 and InternationalPhD=0 and DegreeProgramShortwMissing=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.
  
USE ALL.
EXECUTE.




****Women and US Only  FOR TABLE S3
    
USE ALL.
COMPUTE filter_$=(sex_num=0 and InternationalPhD=0 and DegreeProgramShortwMissing=1).
VARIABLE LABELS filter_$ 'sex_num=0 and InternationalPhD=0 and DegreeProgramShortwMissing=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.


FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.


**TABLES  S4 and S5- US, PS, LAST 10 and used to create FIGURE 1.
*S4 uses the list of 122 as well. Those not in the data are  the ones with 0 placements.


USE ALL.
EXECUTE.


USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1 AND YearPHDearnedGraduated>2012).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShort=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.


FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.

*653

* TABLE S6 is generated via spreadsheet from BOTH THE Table above and the following table

USE ALL.
EXECUTE.


* this creates a table of all graduates placed that graduated before 2013 in order to compare to those after 2012 

USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1 AND YearPHDearnedGraduated<2013).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShort=1 between (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.


FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.



*1945 + 653 = 2598 - missing 63 due to missing year phd graduated

* TABLE S7 Last 10 Men and Women Sep
    
USE ALL.
EXECUTE.


* WOMEN ONLY TABLE S7

USE ALL.
COMPUTE filter_$=(sex_num=0 and InternationalPhD=0 and DegreeProgramShortwMissing=1 AND YearPHDearnedGraduated>2012).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShort=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

*N=298

FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.

*MEN ONLY TABLE S7
    
USE ALL.
EXECUTE.


USE ALL.
COMPUTE filter_$=(sex_num=1 and InternationalPhD=0 and DegreeProgramShortwMissing=1 AND YearPHDearnedGraduated>2012).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShort=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

*N=355

FREQUENCIES VARIABLES=PHDEarnedfrom
  /FORMAT=DFREQ
  /ORDER=ANALYSIS.

***  SI TABLE S8 and FIGURE 2 - which is generated via excel


USE ALL.
EXECUTE.

*dont need to restrict to US only since this is enforced as non US schools have no movement calculated


FREQUENCIES VARIABLES= Movement_All
  /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.



*  THE FOLLOWING OUTPUT IS USED TO CREATE FIGURE S1 Movement by gradates placed minus top 10 programs

** removing top 10 programs from movement 

*TABLE S9

USE ALL.
COMPUTE filter_$=(RankbyGraduatingPhDFrom>10).
VARIABLE LABELS filter_$ 'RankbyGraduatingPhDFrom>10 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.


* checking data - n =1354

FREQUENCIES VARIABLES= PHDEarnedfrom RankbyGraduatingPhDFrom
  /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.

FREQUENCIES VARIABLES= Movement_All
  /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.


USE ALL.
EXECUTE.

FREQUENCIES VARIABLES= PHDEarnedfrom RankbyGraduatingPhDFrom
  /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.


USE ALL.
EXECUTE.

*include all dept missing as PS and US

USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShortwMissing=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

* TABLE 4 


CORRELATIONS
  /VARIABLES=
RankbyGraduatingPhDFrom 
RankbyGraduatingCurrentlyAt 
Years2Assoc
Years2Full
Citations_num
hindex_num
sex_num
CareerLength
Private_vs_Public_only
  /PRINT=TWOTAIL NOSIG
  /MISSING=PAIRWISE.


* unlike SAS,  SPSS 23 does not natively estimate polychor/tetrachoric correlations, so including the following macro - this file on the dataverse needs to be added into a local directory 

include 'ENTER FILE LOCATION\r_tetra.sps'.
SET MXLOOP=2700.


tetcorr sex_num Private_vs_Public_only.

FREQUENCIES VARIABLES= Citations_num
hindex_num
  /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.



* correlation between RANK FROM AND RANK AT- NO TABLE the OUTPUT is NOTED IN THE MAIN TEXT 


CORRELATIONS
  /VARIABLES= RankbyGraduatingPhDFrom RankbyGraduatingCurrentlyAt 
  /PRINT=TWOTAIL NOSIG
  /MISSING=PAIRWISE.


* correlation between dept avg citations and rank - NO TABLE the OUTPUT is NOTED IN THE MAIN TEXT 

USE ALL.
EXECUTE.

USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShort=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

CORRELATIONS
  /VARIABLES=
Avg_Citations_CurrentInstitution
Avg_hindex_CurrentInstitution
RankbyGraduatingCurrentlyAt 
  /PRINT=TWOTAIL NOSIG
  /MISSING=PAIRWISE.

* TABLE S11
    
FREQUENCIES VARIABLES= AgeAdjCitations
  /STATISTICS=STDDEV RANGE MINIMUM MAXIMUM MEAN MODE
  /ORDER=ANALYSIS.

USE ALL.
EXECUTE.



* TABLE S10

USE ALL.
EXECUTE.


USE ALL.
COMPUTE filter_$=(InternationalPhD=0 and DegreeProgramShortwMissing=1).
VARIABLE LABELS filter_$ 'InternationalPhD=0 and DegreeProgramShort=1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.

*TABLE S10 output

REGRESSION
  /MISSING LISTWISE
  /STATISTICS COEFF OUTS R ANOVA
  /CRITERIA=PIN(.05) POUT(.10)
  /NOORIGIN 
  /DEPENDENT RankbyGraduatingCurrentlyAt 
  /METHOD=ENTER RankbyGraduatingPhDFrom Citations_num hindex_num sex_num  
    CareerLength Private_vs_Public_only.






